In Chapter @ref(sampling), we studied sampling. We started with a “tactile” exercise where we wanted to know the proportion of balls in the urn in Figure @ref(fig:sampling-exercise-1) that are red. While we could have performed an exhaustive count, this would have been a tedious process. So instead, we used a shovel to extract a sample of 50 balls and used the resulting proportion that were red as an estimate. Furthermore, we made sure to mix the urn’s contents before every use of the shovel. Because of the randomness created by the mixing, different uses of the shovel yielded different proportions red and hence different estimates of the proportion of the urn’s balls that are red.
Remember: There is a truth here. There is an urn. It has red and white balls in it. An exact, but unknown, number of the balls are red. An exact, but unknown, number of the balls are white. An exact, but unknown, percentage of the balls are red – defined as the number red divided by the sum of the number red and the number white. Our goal is to estimate that unknown percentage. We want to make statements about the world, even if we can never be certain that those statements are true. We will never have the time or inclination to actually count all the balls. We use the term parameter for things that exist but which are unknown. We use statistics to estimate the true values of parameters.
We then mimicked this physical sampling exercise with an equivalent virtual sampling exercise using the computer. In Subsection @ref(different-shovels), we repeated this sampling procedure 1,000 times, using three different virtual shovels with 25, 50, and 100 slots. We visualized these three sets of 1,000 estimates in Figure @ref(fig:comparing-sampling-distributions-3) and saw that as the sample size increased, the variation in the estimates decreased. We then expanded this for all sample sizes from 1 to 100.
In doing so, we constructed sampling distributions. The motivation for taking a 1,000 repeated samples and visualizing the resulting estimates was to study how these estimates varied from one sample to another; in other words, we wanted to study the effect of sampling variation. We quantified the variation of these estimates using their standard deviation, which has a special name: the standard error. In particular, we saw that as the sample size increased from 1 to 100, the standard error decreased and thus the sampling distributions narrowed. Larger sample sizes led to more precise estimates that varied less around the center.
We then tied these sampling exercises to terminology and mathematical notation related to sampling in Subsection @ref(terminology-and-notation). Our study population was the large urn with \(N\) = 2,400 balls, while the population parameter, the unknown quantity of interest, was the population proportion \(p\) of the urn’s balls that were red. Since performing a census would be expensive in terms of time and energy, we instead extracted a sample of size \(n\) = 50. The point estimate, also known as a sample statistic, used to estimate \(p\) was the sample proportion \(\widehat{p}\) of these 50 sampled balls that were red. Furthermore, since the sample was obtained at random, it can be considered as unbiased and as representative of the population. Thus any results based on the sample could be generalized to the population. Therefore, the proportion of the shovel’s balls that were red was a “good guess” of the proportion of the urn’s balls that are red. In other words, we used the sample to draw inferences about the population.
However, as described in Section @ref(sampling-simulation), both the physical and virtual sampling exercises are not what one would do in real life. This was merely an activity used to study the effects of sampling variation. In a real life situation, we would not take 1,000 samples of size \(n\), but rather take a single representative sample that’s as large as possible. Additionally, we knew that the true proportion of the urn’s balls that were red was 37.5%. In a real-life situation, we will not know what this value is. Because if we did, then why would we take a sample to estimate it?
An example of a realistic sampling situation would be a poll, like the Obama poll you saw in Section @ref(sampling-case-study). Pollsters did not know the true proportion of all young Americans who supported President Obama in 2013, and thus they took a single sample of size \(n\) = 2,089 young Americans to estimate this value.
So how does one quantify the effects of sampling variation when you only have a single sample to work with? You cannot directly study the effects of sampling variation when you only have one sample. One common method to study this is bootstrapping resampling.
What if we would like, not only a single estimate of the unknown population parameter, but also a range of highly plausible values? Going back to the Obama poll article, it stated that the pollsters’ estimate of the proportion of all young Americans who supported President Obama was 41%. But in addition it stated that the poll’s “margin of error was plus or minus 2.1 percentage points.” This “plausible range” was [41% - 2.1%, 41% + 2.1%] = [38.9%, 43.1%]. This range of plausible values is what’s known as a confidence interval, which will be the focus of the later sections of this chapter.
Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Recall from our discussion in Section @ref(tidyverse-package) that loading the tidyverse package by running library(tidyverse) loads the following commonly used data science packages all at once:
If needed, read Section @ref(packages) for information on how to install and load R packages.
library(tidyverse)
As we did in Chapter @ref(sampling), we’ll begin with a hands-on tactile activity.
Try to imagine all the pennies being used in the United States in 2019. That’s a lot of pennies! Now say we’re interested in the average year of minting of all these pennies. One way to compute this value would be to gather up all pennies being used in the US, record the year, and compute the average. However, this would be near impossible! So instead, let’s collect a sample of 50 pennies from a local bank in downtown Northampton, Massachusetts, USA as seen in Figure @ref(fig:resampling-exercise-a).
Collecting a sample of 50 US pennies from a local bank.
Collecting a sample of 50 US pennies from a local bank.
An image of these 50 pennies can be seen in Figure @ref(fig:resampling-exercise-c). For each of the 50 pennies starting in the top left, progressing row-by-row, and ending in the bottom right, note there is an “ID” identification variable printed in black and the year of minting printed in white.
50 US pennies labelled.
The moderndive package contains this data on our 50 sampled pennies in the pennies_sample data frame:
library(moderndive)
pennies_sample
## # A tibble: 50 x 2
## ID year
## <int> <dbl>
## 1 1 2002
## 2 2 1986
## 3 3 2017
## 4 4 1988
## 5 5 2008
## 6 6 1983
## 7 7 2008
## 8 8 1996
## 9 9 2004
## 10 10 2000
## # … with 40 more rows
The pennies_sample data frame has 50 rows corresponding to each penny with two variables. The first variable ID corresponds to the ID labels in Figure @ref(fig:resampling-exercise-c), whereas the second variable year corresponds to the year of minting saved as a numeric variable, also known as a double (dbl).
Based on these 50 sampled pennies, what can we say about all US pennies in 2019? Let’s study some properties of our sample by performing an exploratory data analysis. Let’s first visualize the distribution of the year of these 50 pennies using our data visualization tools from Chapter @ref(viz). Since year is a numerical variable, we use a histogram in Figure @ref(fig:pennies-sample-histogram) to visualize its distribution.
pennies_sample %>%
ggplot(aes(x = year)) +
geom_histogram(binwidth = 10, color = "white")
Distribution of year on 50 US pennies.
Observe a slightly left-skewed distribution, since most pennies fall between 1980 and 2010 with only a few pennies older than 1970. What is the average year for the 50 sampled pennies? Eyeballing the histogram it appears to be around 1990. Let’s now compute this value exactly using our data wrangling tools from Chapter @ref(wrangling).
pennies_sample %>%
summarize(mean_year = mean(year))
## # A tibble: 1 x 1
## mean_year
## <dbl>
## 1 1995.
Thus, if we’re willing to assume that pennies_sample is a representative sample from all US pennies, a “good guess” of the average year of minting of all US pennies would be 1995.44. In other words, around 1995. This should all start sounding similar to what we did previously in Chapter @ref(sampling)!
In Chapter @ref(sampling), our study population was the urn of \(N\) = 2400 balls. Our population parameter was the population proportion of these balls that were red, denoted by \(p\). In order to estimate \(p\), we extracted a sample of 50 balls using the shovel. We then computed the relevant point estimate: the sample proportion of these 50 balls that were red, denoted mathematically by \(\widehat{p}\).
Here our population is \(N\) = whatever the number of pennies are being used in the US, a value which we don’t know and probably never will. The population parameter of interest is now the population mean year of all these pennies, a value denoted mathematically by the Greek letter \(\mu\) (pronounced “mu”). In order to estimate \(\mu\), we went to the bank and obtained a sample of 50 pennies and computed the relevant point estimate: the sample mean year of these 50 pennies, denoted mathematically by \(\overline{x}\) (pronounced “x-bar”). An alternative and more intuitive notation for the sample mean is \(\widehat{\mu}\). However, this is unfortunately not as commonly used, so in this book we’ll stick with convention and always denote the sample mean as \(\overline{x}\).
We summarize the correspondence between the sampling urn exercise in Chapter @ref(sampling) and our pennies exercise in Table @ref(tab:table-ch8-b).
| Scenario | Population parameter | Notation | Point estimate | Symbol(s) |
|---|---|---|---|---|
| 1 | Population proportion | \(p\) | Sample proportion | \(\widehat{p}\) |
| 2 | Population mean | \(\mu\) | Sample mean | \(\overline{x}\) or \(\widehat{\mu}\) |
Going back to our 50 sampled pennies in Figure @ref(fig:resampling-exercise-c), the point estimate of interest is the sample mean \(\overline{x}\) of 1995.44. This quantity is an estimate of the population mean year of all US pennies \(\mu\).
Recall that we also saw in Chapter @ref(sampling) that such estimates are prone to sampling variation. For example, in this particular sample in Figure @ref(fig:resampling-exercise-c), we observed three pennies with the year 1999. If we sampled another 50 pennies, would we observe exactly three pennies with the year 1999 again? More than likely not. We might observe none, one, two, or maybe even all 50! The same can be said for the other 26 unique years that are represented in our sample of 50 pennies.
To study the effects of sampling variation in Chapter @ref(sampling), we took many samples, something we could easily do with our shovel. In our case with pennies, however, how would we obtain another sample? By going to the bank and getting another roll of 50 pennies.
Say we’re feeling lazy, however, and don’t want to go back to the bank. How can we study the effects of sampling variation using our single sample? We will do so using a technique known as bootstrap resampling with replacement, which we now illustrate.
Step 1: Let’s print out identically sized slips of paper representing our 50 pennies as seen in Figure @ref(fig:tactile-resampling-1).
Step 1: 50 slips of paper representing 50 US pennies.
Step 2: Put the 50 slips of paper into a hat or tuque as seen in Figure @ref(fig:tactile-resampling-2).